Data Science Applications to Politics Research

Week 2 | GV330 | Lecture 2

Zach Dickson

London School of Economics

Introduction

Overview of the Lecture

  • Modes of Statistical Inference

    • Descriptive inference
    • Predictive inference
    • Causal inference
  • “Big Data” and Causal Inference

    • Limits
    • Opportunities
  • Credibility Crisis in the Social Sciences

    • Reproducibility
    • Transparency
    • Open Science

Three Modes of Statistical Inference

  • Descriptive:
    • Inferring ‘ideal points’ of political actors from roll call votes
    • Inferring elites’ ‘attention’ from social media messages
    • Inferring economic development from satellite images

Three Modes of Statistical Inference

  • Descriptive:
    • Inferring ‘ideal points’ of political actors from roll call votes
    • Inferring elites’ ‘attention’ from social media messages
    • Inferring economic development from satellite images
  • Predictive:
    • Inferring election outcomes from polls
    • Inferring individual behavior from aggregate data
    • Inferring future states of the world from past data

Three Modes of Statistical Inference

  • Descriptive:
    • Inferring ‘ideal points’ of political actors from roll call votes
    • Inferring elites’ ‘attention’ from social media messages
    • Inferring economic development from satellite images
  • Predictive:
    • Inferring election outcomes from polls
    • Inferring individual behavior from aggregate data
    • Inferring future states of the world from past data
  • Causal:
    • Inferring the effect of a policy on voter turnout
    • Inferring the impact of elite messages on public behavior
    • Inferring the influence of social media use on political polarization

Descriptive Inference Example

Adam Bonica (2014): Mapping the Ideological Marketplace

  • Creates a method of measuring the ideological positions of political actors based on their campaign contributions

Descriptive Inference Example

  • Creates a method of measuring the ideological positions of political actors based on their campaign contributions

  • Uses data from the Federal Election Commission (FEC), which includes over 100 million individual contributions to political campaigns from 1979–2012

Descriptive Inference Example

  • Creates a method of measuring the ideological positions of political actors based on their campaign contributions

  • Uses data from the Federal Election Commission (FEC), which includes over 100 million individual contributions to political campaigns from 1979–

  • Introduces the CFScore

    • Item response theory (IRT) model to estimate the ideological positions of political actors based on their campaign contributions
    • Relies on the assumption that actors contribute to candidates who are ideologically proximate to them

Descriptive Inference Example

Ideological Summary of State Politics (2010)

Predictive Inference Example


Nickerson & Rogers (2014): Political Campaigns and Big Data

Predictive Inference Example

  • Campaigns use data to construct predictive models for more efficient targeting

Predictive Inference Example

  • Campaigns use data to construct predictive models for more efficient targeting

  • These models produce three types of “predictive scores” for each citizen in the voter database

    • Behavior scores: use past behavior and demographic information to calculate probabilities that citizens turn out, donate, etc.
    • Support scores: survey a sample of citizens about their candidate/issue support. Use to gauge aggregate preferences
    • Responsiveness scores: predict how citizens will respond to campaign outreach using results of randomized field experiments. Model heterogeneous treatment effects and use to predict treatment responsiveness in target population.

Note: predictive scores for responsiveness based on heterogeneous treatment effects can be seen as causal inference. We see this often in the microtargeting literature.

Predictive Inference Example

‘Big Data’

  • Voter Files
  • Consumer and Property Data
  • Census Data
  • Previous Political Behavior
    • Donating, volunteering, voting

Predictive Inference Example

Notes: The x-axis is likelihood of supporting a Democratic candidate over a Republican candidate, ranging from 0 (left) to 100 (right). The y-axis is likelihood of voting ranging, from 100 (low) to 0 (high).

Causal Inference

Causal Inference

  • Correlations are everywhere, but they do not imply causation
  • Causal inference is about understanding the effect of one variable on another
  • Causal inference requires a counterfactual: what would have happened if the treatment had not been applied?

Causal Inference

  • When we ‘move’ X, what happens to Y?

  • Example:

    • What is the effect of a policy on voter turnout?
    • What is the impact of elite messages on public behavior?
    • What is the influence of social media use on political polarization?

The Fundamental Problem of Causal Inference

We can never observe the counterfactual outcome for the same unit at the same time

Causal Inference Example


Gerber & Green (2000): The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout

Causal Inference Example

Theory

  • Voter turnout theories based on rational self-interested behavior generally fail to predict significant turnout unless they account for the utility that citizens receive from performing their civic duty

Causal Inference Example

Theory

  • Voter turnout theories based on rational self-interested behavior generally fail to predict significant turnout unless they account for the utility that citizens receive from performing their civic duty

  • Two aspects of this type of utility, intrinsic satisfaction from behaving in accordance with a norm and extrinsic incentives to comply

Causal Inference Example

Theory

  • Voter turnout theories based on rational self-interested behavior generally fail to predict significant turnout unless they account for the utility that citizens receive from performing their civic duty

  • Two aspects of this type of utility, intrinsic satisfaction from behaving in accordance with a norm and extrinsic incentives to comply

  • Gerber, Green, and Larimer (2008) test intrinsic motives in a large scale field experiment by applying varying degrees of extrinsic pressure on voters using to series of mailings to 180,002 households before the August 2006 primary election in Michigan.

    • \(Y_i\): Voter turnout for individual \(i\)
    • \(D_i\): Treatment (type of mailing) for individual \(i\)

Causal Inference Example

Hypotheses

  • Civic Duty:

    • Encouraging citizens to vote by appealing to their sense of civic duty
  • Hawthorne Effect:

    • Encouraging citizens to vote by appealing to their sense of being observed
  • Self:

    • Encouraging citizens to vote by appealing to their sense of self-interest
      • Voting is public record
      • Shown whether members of the household voted in the last election
  • Neighbors:

    • Encouraging citizens to vote by appealing to their sense of social pressure
      • Shown whether neighbors voted in the last election

Causal Inference Example

Experimental Treatment

Gerber & Green (2000): The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout

Causal Inference Example

Treatment Groups

Gerber & Green (2000): The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout

Causal Inference Example

Covariate Balance

Gerber & Green (2000): The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout

Overview of the Lecture

  • Modes of Statistical Inference
  • ‘Big Data’ and Causal Inference
  • Creditability Crisis in the Social Sciences

‘Big Data’ and Causal Inference

  • Lots of data doesn’t negate the need for causal inference, but it can help us identify causal relationships

‘Big Data’ and Causal Inference

  • For example, in an RDD context, we might have lots of data around a threshold, which can help us identify causal relationships

Example:

‘Big Data’ and Causal Inference

  • Despite the potential for big data to help us identify causal relationships, there are still challenges
  • Making causal inferences require exogenous variation
  • Computational techniques cannot replace “actual people doing actual thinking about social phenomena.” (Ashworth, Berry, and De Mesquita 2015)

Some Potentially Promising Avenues for Causal Inference

  • Finding instrumental variables in large datasets with language models (Han 2024)
  • Finding instruments and control variables that meet certain conditions with machine learning (Apfel et al. 2024)


Note: These methods are still in the early stages of development and require further research and validation.

Overview of the Lecture

  • Modes of Statistical Inference
  • ‘Big Data’ and Causal Inference
  • Creditability Crisis in the Social Sciences

Creditability Crisis in the Social Sciences

We often hear about …

  • Replication crisis - studies fail to replicate (psych, econ, polisci, medicine, etc.)
  • Publication bias - published studies only represent fraction of results, biased toward statistically significant findings
  • P-hacking/researcher degrees of freedom - published studies use only a fraction of possible specifications, biased toward significance
  • Misconduct/fraud - fabrication, falsification, plagiarism, etc.

Creditability Crisis in the Social Sciences

Why do we have this credibility crisis?

  • Incentives - publish or perish, tenure, grants, etc.
  • Norms - publish only significant results, not negative results
  • Training - lack of training in research design, statistics, etc.
  • Technology - ease of data manipulation, p-hacking, etc.
  • Competition - limited resources, limited space in journals, etc.
  • Complexity - social phenomena are complex, difficult to measure, etc.
  • File Drawer Problem - studies that don’t find significant results are less likely to be published, leading to publication bias

Creditability Crisis in the Social Sciences

Fanelli (2010 & 2011)

Creditability Crisis in the Social Sciences

  • How can we address the credibility crisis?

    • Transparency - open data, open code, pre-registration, etc.
    • Reproducibility - replication studies, pre-registration, etc.
    • Open Science - open access, open data, open code, etc.

References

Apfel, Nicolas, Julia Hatamyar, Martin Huber, and Jannis Kueck. 2024. “Learning Control Variables and Instruments for Causal Analysis in Observational Data.” arXiv Preprint arXiv:2407.04448.
Ashworth, Scott, Christopher R Berry, and Ethan Bueno De Mesquita. 2015. “All Else Equal in Theory and Data (Big or Small).” PS: Political Science & Politics 48 (1): 89–94.
Han, Sukjin. 2024. “Mining Causality: Ai-Assisted Search for Instrumental Variables.” arXiv Preprint arXiv:2409.14202.